App review analysis¶
This is an analysis of a Swedish 'buy now, pay later' company's app on the Google Play Store¶
To scrape reviews and scores, the google_play_scraper was employed (https://pypi.org/project/google-play-scraper/)¶
And this is my attempt to visualise what people are saying about the app¶
The over-arching plan for this analysis is to understand:
- Can the reviews be quantified?
- Are we able to gauge opinions?
- Can we visualise items or topics of interest in reviews?
- Specifically, can we visualise it in a way that is engaging, fun and easy to understand
Basic cleaning, merging and concatenating¶
import pandas as pd
import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import seaborn as sns
from langdetect import detect, LangDetectException
import string
df_gb = pd.read_csv('app-reviews_en-gb.csv', sep='|')
df_us = pd.read_csv('app-reviews_en-us.csv', sep='|')
Duplicates within df_gb: Empty DataFrame Columns: [reviewId, userName, userImage, content, score, thumbsUpCount, reviewCreatedVersion, at, replyContent, repliedAt, appVersion] Index: [] ---------------------------------------------------------------------- Duplicates within df_us: Empty DataFrame Columns: [reviewId, userName, userImage, content, score, thumbsUpCount, reviewCreatedVersion, at, replyContent, repliedAt, appVersion] Index: [] ---------------------------------------------------------------------- Duplicates between the DataFrames: Empty DataFrame Columns: [reviewId, userName_x, userImage_x, content_x, score_x, thumbsUpCount_x, reviewCreatedVersion_x, at_x, replyContent_x, repliedAt_x, appVersion_x, userName_y, userImage_y, content_y, score_y, thumbsUpCount_y, reviewCreatedVersion_y, at_y, replyContent_y, repliedAt_y, appVersion_y] Index: [] [0 rows x 21 columns]
result_df = pd.concat([df_gb, df_us], ignore_index=True)
duplicates_in_result = result_df[result_df.duplicated(subset=['reviewId', 'userName', 'appVersion'], keep=False)]
print("Duplicates within result:", len(duplicates_in_result))
Duplicates within result: 59368
result_df_no_duplicates = result_df.drop_duplicates(subset=['reviewId', 'userName', 'appVersion'], keep='first')
print("Shape of df after removing duplicates:", result_df_no_duplicates.shape)
Shape of df after removing duplicates: (30316, 11)
Initialising the main df for use throughout¶
df = result_df_no_duplicates
Sentiment analysis¶
How does the rating of an app affect people's sentiment when writing a review?¶
Using TextBlob we can look at basic natural language while also:¶
- Calculating a sentiment score
- -1.0 to 1.0 (Negative - Positive) with 0.0 being Neutral
- We can group sentiment ratings into the score an see how people talk about an app based on the review
from textblob import TextBlob
sdf = df.copy()
def calculate_sentiment(text):
return TextBlob(text).sentiment.polarity
# Calculate sentiment polarity for each review
sdf['sentiment'] = sdf['content'].dropna().apply(calculate_sentiment)
# Group by rating and calculate average sentiment
avg_sentiment_by_rating = sdf.groupby('score')['sentiment'].mean().reset_index()
avg_sentiment_by_rating
| score | sentiment | |
|---|---|---|
| 0 | 1 | -0.043658 |
| 1 | 2 | 0.028367 |
| 2 | 3 | 0.082904 |
| 3 | 4 | 0.305454 |
| 4 | 5 | 0.485364 |
plt.figure(figsize=(10, 5))
plt.bar(avg_sentiment_by_rating['score'], avg_sentiment_by_rating['sentiment'], color='pink')
plt.xlabel('Score')
plt.ylabel('Average Sentiment')
plt.title('Average Sentiment by Score')
plt.xticks(avg_sentiment_by_rating['score'])
plt.grid(axis='y')
plt.show()
Some findings¶
- Reviewers on average do not take a completely negative tone when giving the app a bad score (1)
- Every score from 2 and above is positive in some way shape or form
- A score of 5 (sentiment: 0.49) are moderately positive on average
There seems to be a clear correlation between review sentiment and app score¶
What's the spread of scoring like?¶
Polar scoring¶
- People either love it or hate it
- Despite the high number of score == 1, it doesn't necessarily mean people a talking negatively about the app
- It likely misses a feature the reviewer wants, or minor bugs make the experience less than favourable
- It may also speak to human pyschology, in that reviewers would pehaps prefer a binary system (like/dislike)
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=sdf, x='score')
for p in ax.patches:
ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='baseline', fontsize=8, color='black', xytext=(0, 5),
textcoords='offset points')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.title('App Ratings')
plt.xticks(range(0, 5), labels=[1, 2, 3, 4, 5])
plt.show()
Does sentiment and score correlate generally over time?¶
non_nan_count = df['content'].notna().sum()
print("Number of non-NaN rows in 'content' column:", non_nan_count)
Number of non-NaN rows in 'content' column: 30316
tdf = df.copy()
tdf['at'] = pd.to_datetime(tdf['at'], format='%Y-%m-%d %H:%M:%S')
tdf['content'].fillna("", inplace=True)
tdf['sentiment'] = tdf['content'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)
tdf[['rating', 'sentiment']] = tdf[['score', 'sentiment']].apply(pd.to_numeric, errors='coerce')
tdf.set_index('at', inplace=True)
# Resample by day and calculate mean for rating and sentiment
tdf_resampled = tdf[['rating', 'sentiment']].resample('D').mean().reset_index()
# Calculate EMAs for both rating and sentiment
ema_span = 30.417 # Approximately one month
tdf_resampled['ema_rating'] = tdf_resampled['rating'].ewm(span=ema_span).mean()
tdf_resampled['ema_sentiment'] = tdf_resampled['sentiment'].ewm(span=ema_span).mean()
plt.figure(figsize=(15, 7))
ax1 = plt.gca()
ax2 = ax1.twinx()
ax1.plot(tdf_resampled['at'], tdf_resampled['ema_rating'], label='EMA Rating', color='blue')
ax1.set_xlabel('Review Date')
ax1.set_ylabel('Average Rating', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
# Plot EMA of sentiment on the secondary y-axis
ax2.plot(tdf_resampled['at'], tdf_resampled['ema_sentiment'], label='EMA Sentiment', color='green')
ax2.set_ylabel('Average Sentiment', color='green')
ax2.tick_params(axis='y', labelcolor='green')
plt.title(f'Average Rating and Sentiment Over Time with EMA (Span = {ema_span} days)')
plt.grid(True)
plt.tight_layout()
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.show()
Sentiment and ratings/score are very closely correlated over time¶
- There is a steep decrease in the number of reviews before 2019
- This affects the quality of the graph and insights prior
- Even in the noisy period prior to 2019, there is stil incredibly good alignment between rating and sentiment
This shows that rating/scores can be used as an indicator of sentiment¶
Some thoughts¶
Scores/Ratings are inherently quantitative, making them straightforward to aggregate, compare, and analyze statistically. They offer a clear, albeit simplistic, measure of user sentiment.
- Scores provide a direct measure of opinion, but they lack context and the nuances of why a user might have given a particular rating. Two users might give the same score for entirely different reasons.
Sentiments are inherently qualitative. Analysis involves evaluating the text of a review to determine the reviewer's subjective feelings or attitudes.
- It can help reveal the reasons behind a user's score or even offer insights into aspects of the product or service that weren't explicitly rated.
- Sentiment analysis can be more subjective and depends heavily on the quality of the analysis tools. Ambiguity and varying expressions of sentiment across different cultures and languages can make accurate sentiment analysis challenging. Comparing Scores and Sentiment Complementary Insights: Scores provide a quick, at-a-glance view of user opinions that is easy to quantify. In contrast, sentiment analysis delves deeper into the "why" behind the scores, offering richer, qualitative insights.
The above analysis shows that Scores/Ratings can be used as a pseudo-indicator of Sentiment for this particular app during the period interrogated.
tdf.reset_index(inplace=True)
year_count = tdf['at'].dt.year.value_counts().sort_index()
year_count
at 2018 55 2019 3584 2020 5267 2021 9044 2022 8354 2023 4012 Name: count, dtype: int64
Thumbs up¶
- People can give reviews a 'thumbs up'
- We'll make the assumption that a 'thumbs up' 👍 is an agreement to the review
average_thumbs_up = df.groupby('score')['thumbsUpCount'].mean()
plt.figure(figsize=(10, 6))
plt.bar(average_thumbs_up.index, average_thumbs_up.values, color='blue')
plt.xlabel('Rating')
plt.ylabel('Average Thumbs Up Count')
plt.title('Average Thumbs Up Count per Rating')
plt.xticks(average_thumbs_up.index)
plt.show()
People tend to agree with a review when it is less than positive¶
We'll call a score of 2 the 'elbow point'¶
This result above suggests a few critical insights into user behavior towards the app:
Validation of Negative Experiences: Users tend to agree more frequently with lower-score reviews.
- This could indicate a shared sentiment of dissatisfaction among a substantial portion of the app's user base.
Critical Engagement: Users might be more inclined to engage with negative reviews as they seek validation for their own experiences.
Opportunities for Improvement: From a developer or app owner's perspective, the prominence of thumbs-up on lower-score reviews serves as a critical feedback loop.
- It can highlight areas needing urgent attention and improvement. Addressing these commonly agreed-upon issues could significantly enhance user satisfaction and overall app perception.
Community and Empathy: The act of agreeing with reviews, especially negative ones, underscores a community aspect where users feel a sense of solidarity.
import gensim
from gensim import corpora
from gensim.models import LdaModel
import pyLDAvis.gensim_models
ldf = df.copy()
# download stop words for multiple languages as a safety
nltk.download('punkt')
stop_words = set(nltk.corpus.stopwords.words(['english', 'german', 'spanish', 'swedish']))
# additional words to exclude - these words muddle topics
exclude_words = {'ca', 'app', 'ap'}
# tokenize and clean text
tokenized_data = []
for review in ldf['content']:
if not isinstance(review, str):
continue
try:
if detect(review) == 'en': # detect the language of the review
tokens = nltk.word_tokenize(review.lower())
tokens = [word for word in tokens if word.isalpha() and word not in stop_words and word not in exclude_words]
tokenized_data.append(tokens)
except LangDetectException:
continue # skip the review if language can't be detected
# create a dictionary and corpus
dictionary = corpora.Dictionary(tokenized_data)
corpus = [dictionary.doc2bow(text) for text in tokenized_data]
# generate the LDA model with num_topics calculated in def compute_coherence_values
lda_model = LdaModel(corpus, num_topics=7, id2word=dictionary, passes=15, random_state=1)
lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)
topic_counter = np.zeros(7)
# Go through the corpus and get the topic distribution for each document
for doc in corpus:
topic_distribution = lda_model.get_document_topics(doc)
for topic, proportion in topic_distribution:
topic_counter[topic] += proportion
# Normalize the counts to get proportions
topic_proportions = topic_counter / topic_counter.sum()
plt.figure(figsize=(12, 6))
plt.bar(range(1, len(topic_proportions) + 1), topic_proportions)
plt.xlabel('Topic Number')
plt.ylabel('Proportion')
plt.title('Topic Proportions')
plt.show()
from gensim.models.coherencemodel import CoherenceModel
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_data, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f'Coherence Score: {coherence_lda}')
Coherence Score: 0.6141385120585212
The below function allows us to determine the best number of topics¶
- By calculating a conherence score, we can tweak the model to get the best results
- It's an incredibly slow process, do not run it more than once 😅
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
coherence_values = []
model_list = []
for num_topics in range(start, limit, step):
model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)
model_list.append(model)
coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
# Function call
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=tokenized_data, start=2, limit=40, step=4)
limit=40; start=2; step=4;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.minorticks_on()
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='gray')
plt.show()
Generate key words for each topic¶
top_topics = lda_model.show_topics(num_topics=7, num_words=20, formatted=False)
for i, topic in enumerate(top_topics):
print(f"Topic {i+1}:")
print(", ".join([word[0] for word in topic[1]]))
Topic 1: card, time, get, let, use, tried, phone, even, account, ghost, email, try, number, one, waste, work, trying, cant, says, log Topic 2: pay, love, buy, great, get, way, need, payments, things, want, like, thank, much, later, recommend, able, best, time, helps, thanks Topic 3: love, great, easy, use, good, best, shopping, service, awesome, convenient, far, really, payment, shop, helpful, experience, payments, way, absolutely, brilliant Topic 4: credit, time, used, use, purchase, never, using, paid, good, power, better, even, purchases, afterpay, limit, approved, payment, always, made, make Topic 5: service, customer, company, help, support, bank, use, know, ever, people, never, worst, take, scam, issue, horrible, account, contact, bad, give Topic 6: payment, order, pay, money, purchase, card, make, account, payments, first, get, still, back, due, got, one, item, bank, days, never Topic 7: work, slow, update, works, open, keeps, working, load, like, needs, takes, fix, screen, please, even, see, trying, page, website, get
Pass the topics to a LLM to identify themes¶
Topics and Their Themes¶
Topic 1: User Interface and Access Issues
- Keywords: card, time, get, let, use, tried, phone, even, account, ghost, email, try, number, one, waste, work, trying, cant, says, log
- Theme: This topic appears to focus on issues related to using the app, especially problems with logging in, account access, and functionality on mobile devices. "Ghost" might refer to temporary or virtual cards not working as expected, suggesting frustrations with financial transactions or account management.
Topic 2: Positive Feedback on Functionality and Convenience
- Keywords: pay, love, buy, great, get, way, need, payments, things, want, like, thank, much, later, recommend, able, best, time, helps, thanks
- Theme: Reviews in this topic seem to express appreciation for the app's payment features, especially for making purchases and managing payments over time. The repeated expressions of gratitude and recommendations indicate high user satisfaction with the app's convenience and ease of use.
Topic 3: Excellence in Shopping Experience
- Keywords: love, great, easy, use, good, best, shopping, service, awesome, convenient, far, really, payment, shop, helpful, experience, payments, way, absolutely, brilliant
- Theme: This topic highlights the app's excellence in providing a seamless shopping experience, emphasizing the ease of use, convenience, and customer service. Words like "awesome," "brilliant," and "best" suggest a very positive user perception, particularly regarding the shopping and payment functionalities.
Topic 4: Financial Features and Credit Management
- Keywords: credit, time, used, use, purchase, never, using, paid, good, power, better, even, purchases, afterpay, limit, approved, payment, always, made, make
- Theme: Focused on the financial aspects of the app, including credit use, purchase management, and comparisons with similar services like Afterpay. Issues with credit limits and approval processes are also touched upon, along with the reliability of making and managing payments.
Topic 5: Customer Service and Support Concerns
- Keywords: service, customer, company, help, support, bank, use, know, ever, people, never, worst, take, scam, issue, horrible, account, contact, bad, give
- Theme: Central to this topic are criticisms of customer service and support, with strong negative sentiment expressed through words like "worst," "scam," "horrible," and "bad." Users express frustration with how the company handles support issues, account problems, and interactions with the bank.
Topic 6: Transaction and Payment Issues
- Keywords: payment, order, pay, money, purchase, card, make, account, payments, first, get, still, back, due, got, one, item, bank, days, never
- Theme: This topic deals with specific grievances related to transactions, payments, and financial dealings through the app. There are mentions of delays, problems with orders and refunds, and difficulties in making or receiving payments.
Topic 7: Technical Performance and Usability
- Keywords: work, slow, update, works, open, keeps, working, load, like, needs, takes, fix, screen, please, even, see, trying, page, website, get
- Theme: Concerns about the app's technical performance, including issues with updates, slowness, loading times, and general usability problems. Users are asking for improvements and fixes to enhance the app's functionality and user experience.
Wordcloud per score/rating¶
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
import nltk
import random
# Set a random seed for reproducibility
random.seed(42)
nltk.download('punkt')
nltk.download('stopwords')
ddf = df.copy()
custom_exclude = ['app', 'ap', 'ca', 'appen']
colors = ["#0000FF", "#FF69B4"]
cmap = LinearSegmentedColormap.from_list("custom", colors, N=256)
# Generate word cloud
def generate_word_cloud(text):
wordcloud = WordCloud(width=800, height=400, background_color='white', colormap=cmap).generate_from_frequencies(text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
stop_words = set(stopwords.words('english'))
grouped_reviews = ddf.groupby('score')['content'].apply(lambda x: ' '.join(x.dropna().astype(str))).reset_index()
# Generate word clouds for each rating group
for _, row in grouped_reviews.iterrows():
print(f"Word Cloud for Rating {row['score']}")
tokens = [word for word in word_tokenize(row['content'].lower()) if word not in stop_words and word not in custom_exclude and word.isalpha()]
word_freq = Counter(tokens)
bigrams = list(ngrams(tokens, 2))
bigram_freq = Counter(map(lambda x: ' '.join(x), bigrams))
merged_freq = word_freq + bigram_freq
generate_word_cloud(merged_freq)
Word Cloud for Rating 1
Word Cloud for Rating 2
Word Cloud for Rating 3
Word Cloud for Rating 4
Word Cloud for Rating 5
Wordcloud per topic¶
colors = ["#0000FF", "#FF69B4"]
cmap = LinearSegmentedColormap.from_list("custom", colors, N=256)
# Generate word cloud from word frequencies
def generate_word_cloud_topic(word_freqs):
wordcloud = WordCloud(width=800, height=400, background_color='white', colormap=cmap).generate_from_frequencies(word_freqs)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
top_topics = lda_model.show_topics(num_topics=7, num_words=200, formatted=False)
for topic_num, words in top_topics:
print(f"Word Cloud for Topic {topic_num+1}")
word_freqs = {word: prob for word, prob in words}
generate_word_cloud_topic(word_freqs)
Word Cloud for Topic 1
Word Cloud for Topic 2
Word Cloud for Topic 3
Word Cloud for Topic 4
Word Cloud for Topic 5
Word Cloud for Topic 6
Word Cloud for Topic 7
Final Thoughts¶
- The goals of this analysis were:
Can the reviews be quantified?
- Reviews can be quantified and some basic take aways can be extracted. First, reviewers of this app tend towards a score of either 1 or 5.
- It would be interesting to see if this is true of other app's reviews. But the 1 to 5 review scale is used a lot more similarly to a binary (like/dislike) rating scale.
- Despite the scale used, 54% of reviews in the sample were scored 5.
- There seems to be a tend toward people 'liking' or giving a thumbs-up to reviews with a lower score. The assumption being that it validates a user's negative experience quickly and efficiently.
Are we able to gauge opinions?
- Yes, and in multiple ways.
- We can use Latent Dirichlet Allocation (LDA) to find thematic structure amongst reviews.
- LDA is a generative probabilistic model which requires fine tuning and thoughtful interpretation of results as topics are not labelled.
- Topics could then be passed to a LLM (Chat GPT) to create themes based on the input.
- Many of the topics revolved around expected themes when deal with a 'buy now, pay later' app.
- This further helped confirm that the topics generated were meaningful.
- Topics 2 and 3 comprised about 40% of the total proportion of topics.
- Topics 2 and 3 revolved around 'Positive Feedback on Functionality and Convenience' and 'Excellence in Shopping Experience'.
- Both of which are positive topics and correlate nicely with our score 5 ratings.
- Topic 1, 4 and 6 related to topics suchs as UX and payments.
- Almost always negative themes mentioning issues, delays and reliability.
- Topic 5 and 7 are also UX related, but relating to experience both within and outside of the app.
- Also negative topics in regards to customer service, banking issues and general 'buggy' app issues.
Can we visualise items or topics of interest in reviews?
- The idea of visualising text and themes is nightmarish, but wordclouds are a fun and interesting way to do so.
- Visualising the words used and using the size of each word as a scale for its use frequency really helps to drive home the message of these types of anaylses.
- Word clouds could be generated for various groupings (e.g. scores or topics).
- I believe this gives those without a background in data analysis a solid understanding of reviewer/user thoughts without the need for bar or line graphs.